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Modeling species abundance patterns using local environmental 
features is an important, current problem in ecology. The Cape Floris- 
tic Region (CFR) in South Africa is a global hot spot of diversity and 
endemism, and provides a rich class of species abundance data for 
such modeling. Here, we propose a multi-stage Bayesian hierarchical 
model for explaining species abundance over this region. Our model 
is specified at areal level, where the CFR is divided into roughly 
37,000 one minute grid cells; species abundance is observed at some 
locations within some cells. The abundance values are ordinally cat- 
egorized. Environmental and soil-type factors, likely to influence the 
abundance pattern, are included in the model. We formulate the em- 
pirical abundance pattern as a degraded version of the potential pat- 
tern, with the degradation effect accomplished in two stages. First, 
we adjust for land use transformation and then we adjust for mea- 
surement error, hence misclassification error, to yield the observed 
abundance classifications. An important point in this analysis is that 
only 28% of the grid cells have been sampled and that, for sampled 
grid cells, the number of sampled locations ranges from one to more 
than one hundred. Still, we are able to develop potential and trans- 
formed abundance surfaces over the entire region. 

In the hierarchical framework, categorical abundance classifica- 
tions are induced by continuous latent surfaces. The degradation 
model above is built on the latent scale. On this scale, an areal level 
spatial regression model was used for modeling the dependence of 
species abundance on the environmental factors. To capture antic- 
ipated similarity in abundance pattern among neighboring regions, 
spatial random effects with a conditionally autoregressive prior (CAR) 
were specified. Model fitting is through familiar Markov chain Monte 
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Carlo methods. While models with CAR priors are usually efficiently 
fitted, even with large data sets, with our modeling and the large 
number of cells, run times became very long. So a novel parallelized 
computing strategy was developed to expedite fitting. The model 
was run for six different species. With categorical data, display of 
the resultant abundance patterns is a challenge and we offer several 
different views. The patterns are of importance on their own, com- 
paratively across the region and across species, with implications for 
species competition and, more generally, for planning and conserva- 
tion. 

1. Introduction. Ecologists increasingly use species distribution models 
to address theoretical and practical issues, including predicting the response 
of species to climate change [Midgley and Thuiller (2007), Fitzpatrick et 
al. (2008), Loarie et al. (2008)], designing and managing conservation areas 
[Pressey et al. (2007)], and finding additional populations of known species or 
closely related sibling species [Raxworthy et al. (2003), Guisan et al. (2006)]. 
In all these applications the core problem is to use information about where 
a species occurs and about relevant environmental factors to predict how 
likely the species is to be present or absent in unsampled locations. 

The literature on species distribution modeling covers many applications; 
there are useful review papers that organize and compare model approaches 
[Guisan and Zimmerman (2000), Guisan and Thuiller (2005), Elith et al. 
(2006), Graham and Hijmans (2006), Wisz et al. (2008)]. Most species dis- 
tribution models ignore spatial pattern and thus are based implicitly on 
two assumptions: (1) environmental factors are the primary determinants of 
species distributions and (2) species have reached or nearly reached equilib- 
rium with these factors [Schwartz et al. (2006), Beale et al. (2007)]. These 
assumptions underlie the currently dominant species distribution model- 
ing approaches — generalized linear and additive models (GLM and GAM), 
species envelope models such as BIOCLIM [Busby (1991)], and the maxi- 
mum entropy-based approach MAXENT [Phillips and Dudi'k (2008)]. The 
statistics literature covers GLM and GAM models extensively. The latter 
tends to fit data better than the former since they employ additional pa- 
rameters but lose simplicity in interpretation and risk overfitting and poor 
out-of-sample prediction. Climate envelope models and the now increasingly- 
used maximum entropy methods are algorithmic and not of direct interest 
here. 

In addition to the fundamental ecological issues mentioned above, compli- 
cation arises in various forms in modeling abundance from imperfect survey 
data such as observer error [Royle et al. (2007), Cressie et al. (2009)], variable 
sampling intensity, gaps in sampling, and spatial misalignment of distribu- 
tional and environmental data [Gelfand et al. (2005a)]. First, since a region 
is almost never exhaustively sampled, individuals not exposed to sampling 
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will be missed. Second, it may be that potentially present individuals are 
undetected [Royle et al. (2007)] and, possibly vice versa, for example, a false 
positive misclassiflcation error with regard to species detection [Royle and 
Link (2006)]. A third complication is that ecologists and field workers are 
biased against absences; they tend to sample where species are, not where 
they aren't. Such preferential sampling and its impact on inference is dis- 
cussed in Diggle, Menezes and Su (2010). Further complications arise with 
animals due to their mobility. Previous work has developed spatial hier- 
archical models that accommodate some of these difficulties, fitting these 
models to presence/absence data in a Bayesian framework [Hooten, Larsen 
and Wikle (2003), Gelfand et al. (2005a, 2005b), Latimer et al. (2006)]. 

The species distribution modeling discussed above is either in the pres- 
ence/absence or presence-only data settings; there is relatively little work 
on spatial abundance patterns, despite their theoretical and practical im- 
portance [Kunin, Hartley and Lennon (2000), Gaston (2003)]. Our primary 
contribution here is to develop a hierarchical modeling approach for ordi- 
nal categorical abundance data, explained by the suitability of the environ- 
ment, the effect of land use/land transformation, and potential misclassiflca- 
tion error. Ordinal classifications are often the case in ecological abundance 
data, especially for plants [Sutherland (2006), Ibanez et al. (2009)]. From 
a stochastic modeling perspective, categorical data can be viewed as the 
outcome of a multinomial model, with the cell probabilities dependent on 
background features. Within a Bayesian framework such modeling is often 
implemented using data augmentation [Albert and Chib (1993)], introduc- 
ing a latent hierarchical level. There, the ordered classification is viewed 
as a clipped version of a single latent continuous response, introducing cut 
points. See also De Oliveira (2000) and Higgs and Hoeting (2010). 

At the latent level, suitability of the environment can be modeled through 
regression. Availability in terms of land use degrades suitability. That is, an 
important feature of our modeling, from an ecological point of view, is that 
it deals with transformation of the study area by human intervention. In 
much of the region, the "natural" state of areas has been altered to an 
agricultural or urban state, or the vegetation has been densely colonized 
by alien invasive plant species. So, we cannot treat the entire region as 
equally available to the plant species we are modeling. We must introduce 
a contrast between the current abundance of species (their transformed or 
adjusted abundance) and their potential distributions in the absence of land 
use change (potential abundance). These notions are formally defined at the 
areal unit level in Section 3. A further degradation enabling the possibility 
of misclassification and/or observer error in the data collection procedure 
can be accounted for as measurement error in the latent surface. There 
is a substantial literature on measurement error modeling for continuous 
observations, for example, Fuller (1987), Stefanski and Carroll (1987), and 
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Mallick and Gelfand (1995). In our modeling we impose a hard constraint: 
with no potential presence (i.e., an unsuitable environment), we can observe 
only zero abundance. We enforce this constraint on the latent scale. With 
cut points, modeled as random, we provide an explanatory model for the 
observed categorical abundance data. Furthermore, we can invert from the 
latent abundance scale to the categorical abundances to predict abundance 
for unsampled cells and also to predict abundance in the absence of land use 
transformation. 

With spatial data collection, we anticipate spatial pattern in abundance 
and thus introduce spatial structure into our modeling. That is, causal eco- 
logical explanations such as localized dispersal, as well as omitted (unob- 
served) explanatory variables with spatial pattern such as local smoothness 
of geological or topographic features, suggest that, at sufficiently high resolu- 
tion, abundance of a species at one location will be associated with its abun- 
dance at neighboring locations [Ver Hoef et al. (2001)]. Moreover, through 
spatial modeling, we can provide spatial adjustment to cells that have not 
been sampled, accommodating the gaps in sampling and irregular sampling 
intensity mentioned above. In particular, we create a latent process model 
through a trivariate spatial process specification, with truncated support, 
to capture potential abundance, land transformation- adjusted abundance, 
and measurement error-adjusted abundance. Since our environmental infor- 
mation is available at grid cell level, we use Markov random field (MRF) 
models [Besag (1974), Banerjee, Carlin and Gelfand (2004)] to capture spa- 
tial dependence and to facilitate computation. However, we work with a large 
landscape of approximately 37,000 grid cells which leads to very long run 
times in model fitting and so we introduce a novel parallelized computing 
strategy to expedite fitting. 

There have been other recent developments in modeling of species abun- 
dances, some using Bayesian hierarchical models. First, there has been some 
work on developing models that deal almost exclusively with animal census 
data, including count data and mark-recapture data [Royle et al. (2007), 
Conroy et al. (2008), Gorresen et al. (2009)]. Potts and Elith (2006) provide 
an overview of abundance modeling, in fact, five regression models (Poisson, 
negative binomial, quasi-Poisson, the hurdle model, and the zero-inflated 
Poisson) fitted for one particular plant example. These models focus on cor- 
recting observer error and bias as well as under-detection (the species is 
present but not observed), whence the "true" abundance is virtually always 
higher than observed [Royle et al. (2007), Cressie et al. (2009)]. We note 
some very recent work on working with ordinal species abundance in plant 
data by Ibahez et al. (2009). This approach takes ordinally scored abun- 
dances and uses an ordered logit hierarchical Bayes model to infer potential 
abundances for species that are still spreading across the landscape. 
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The advantages of working within the Bayesian framework with Markov 
chain Monte Carlo (MCMC) model fitting are familiar by now — full infer- 
ence about arbitrary unknowns, that is, functions of model parameters and 
predictions, can be achieved through their posterior distributions, and un- 
certainty can be quantified exactly rather than through asymptotics. In this 
application we work with the disaggregated data at the level of individual 
species and sites to present spatially resolved abundance "surfaces" and to 
capture uncertainty in model parameters. Doing this turns out to be more 
difficult than might be expected, as we reveal in our model development 
section. The key modeling issues center on careful articulation of the defi- 
nition of events and associated probabilities, the misalignment between the 
sampling for abundance (at the relatively small sampling sites) and the avail- 
able environmental data layers (at a scale of minute by minute grid cells, 
roughly 1.55 km x 1.85 km over the region), the sparseness of observations 
in terms of the entire landscape (with uneven sampling intensity including 
many "holes" ) , the occurrence of considerable human intervention with re- 
gard to land use across the landscape ("transformation"), and the need for 
spatially explicit modeling. 

The format of the paper is as follows. Section 2 describes the motivating 
data set. Section 3 develops the multi-level abundance model. Section 4 
details the computational and inference issues. In Section 5 we present an 
analysis of the data from the Cape Floristic Region (CFR) and conclude 
with some discussion and future extensions in Section 6. 

2. Data description. The focal area for this abundance study is the Cape 
Floristic Kingdom or Region (CFR), the smallest of the world's six flo- 
ral kingdoms (Figure 1). As noted above, it encompasses a small region 
of southwestern South Africa, about 90,000 km 2 , including the Cape of 
Good Hope, and is partitioned into 36,907 minute-by-minute grid cells of 
equal area. It has long been recognized for high levels of plant species diver- 
sity and endemism across all spatial scales. The region includes about 9000 
plant species, 69% of which are found nowhere else [Goldblatt and Manning 
(2000)]. Globally, this is one of the highest concentrations of endemic plant 
species in the world. It is as diverse as many of the world's tropical rain 
forests and apparently has the highest density of globally endangered plant 
species [Rebelo (2002)]. The plant diversity in the CFR is concentrated in 
relatively few groups, such as the icon flowering plant family of South Africa, 
the Proteaceae. We focus on this family because the data on species distri- 
bution and abundance patterns are sufficiently rich and detailed to allow 
complex modeling. The Proteaceae have also shown a remarkable level of 
speciation, with about 400 species across Africa, of which 330 species are 
99% restricted to the CFR. Of those 330 species, at least 152 are listed as 
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"threatened" with extinction by the International Union for the Conserva- 
tion of Nature. Proteaceae have been unusually well sampled across the re- 
gion by the Protea Atlas Project of the South African National Biodiversity 
Institute [Rebelo (2001)]. Data were collected at record localities: relatively 
uniform, geo-referenced areas typically 50 to 100 m in diameter. In addition 
to the presence (or absence) at the locality of protea species, abundance of 
each species along with selected environmental and species-level information 
were also tallied [Rebelo (1991)]. To date, some 60,000 localities have been 
recorded (including null sites), with a total of about 250,000 species counts 
from among some 375 proteas [Rebelo (2006)]. 

Abundance is given for a sampling locality. Evidently, there is no notion 
of abundance at a point; however, with roughly 60,000 sites sampled over 
the entire CFR, the relative scale of the Protea Atlas observations is small 
enough when compared to our areal units to be considered as "points." In 
the literature, abundance is sometimes measured as percent cover [Mueller- 
Dombois and Ellenberg (2003)]. In our data set, abundance is recorded as 
an ordinal categorical classification of the count for the species with four 
categories: category 0: none observed, category 1: 1-10 observed, category 
2: 11-100 observed, category 3: >100 observed. Such categorization is fast 
and efficient for studying many species and many sampling locations but 
is certainly at risk for measurement error in the form of misclassification. 
Additionally, a large number of cells were not sampled at all. In fact, only 
10,158, that is, 28%, were sampled at one or more sites. Even among cells 
sampled, some have just one or two sites while others have more than 100, 
reflecting the irregular and opportunistic nature of the sampling rather than 
an experimentally designed sampling plan. 
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Fig. 1. Location of the Cape Floristic Region (CFR) of South Africa. Inset shows the 
location of the CFR within the African Continent. The 90,000 km 2 region was divided into 
36,907 1-minute cells for modeling. 



SPATIAL MODELING FOR SPECIES ABUNDANCE 



7 



Turning to the covariates, in Gelfand et al. (2005a, 2005b) 16 explana- 
tory environmental variables were studied, capturing climate, soil, and topo- 
graphic features (further detail is provided there). Here, we confine ourselves 
to the six most significant variables from that study, which are Evapotran- 
spiration (APAN.MEAN), July (winter) minimum temperature (MIN07), 
January (summer) maximum temperature (MAX01), mean annual precip- 
itation (MEAN. AN. PR), summer soil moisture days (SUMSMD), and soil 
fertility (FERT1). Transformed areas (by agriculture, reforestation, alien 
plant infestation, and urbanization) were obtained as a GIS data layer from 
R. Cowling (private communication). Figure 2 shows the pattern of trans- 
formation across the CFR. Approximately 1/3 of the Cape has been trans- 
formed, mainly in the lowlands on more fertile soils where rainfall is ad- 
equate [Rouget et al. (2003)]. Most of the transformation outside of these 
areas, on the infertile mountains, is due to dense alien invader species, which 
are currently a major threat to Fynbos vegetation and, in particular, to the 
Proteaceae. 

3. Multi-level latent abundance modeling. In Section 3.1 we briefly re- 
view the earlier work on hierarchical modeling for presence/absence data, 
presented in Gelfand et al. (2005a, 2005b), in order to reveal how we have 
generalized it for the abundance problem. Section 3.2 develops our proposed 
probability model for the categorical abundance data. In Section 3.3 discrete 
probability distributions are replaced using latent continuous variables. In 
Section 3.4 we discuss bias issues associated with modeling abundance data 
and, in particular, how they affect our setting. Section 3.5 deals with explicit 
model details for the likelihood and posterior. 




Fig. 2. Proportion of untransformed land inside the CFR. Most of the transformation is 
due to agriculture, but includes dense stands of alien invasive species. 
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3.1. Hierarchical presence/absence modeling. In Gelfand et al. (2005a, 
2005b) the authors model at the scale of the grid cells in the CFR and provide 
a block averaged binary process presence/absence model at this scale. In 
particular, let Ai C M 2 denote the geographical region corresponding to the 

(k) 

ith. grid cell and X> the event that a randomly selected location within Ai 
is suitable (1) or unsuitable (0) for species k. Set P{x\ k) = 1) = pf\ Then, 
Pi is conceptualized by letting (s) be a binary process over the region 
indicating the suitability (1) or not (0) of location s for species k and taking 

(k) 

p\ to be the block average of this process over unit i. That is, 

(3.1) P^ = jh[ X(k) ^ ds =lh [ U^ ik \s) = l)d S , 

I 1\ u j4j | 2 1 u Ai 

where \Ai\ denotes the area of Ai. From Equation (3.1), the interpretation 
is that the more locations within Ai with \( k \s) = 1, the more suitable Ai 
is for species k, that is, the greater the chance of potential presence in Ai. 

(k) 

The collection of p\ 's over the A, is viewed as the potential distribution of 
species k. 

(k) 

Let denote the event that a randomly selected location in Ai is suit- 
able for species k in the presence of transformation of the landscape. Let 
T(s) be an indicator process indicating whether location s is transformed 
(1) or not (0). Then, at s, both T(s) = (availability) and A (fc) (s) = 1 (suit- 
ability) are needed in order that location s is suitable under transformation. 
Therefore, 

(3.2) P(y/ fe) = 1) = -L f l(T(s) = 0)l(\W(s) = l)ds. 

\ A i\ J A, 

If, for each pixel, availability is uncorrelated with suitability, then Equation 
(3.2) simplifies to P(v} k) = 1) = Uip\ , where Ui denotes the proportion of 
area in Ai which is untransformed, < Ui < 1. 

Next, assume that Ai has been visited rii times in untransformed areas 
within the cell. Further, let y^j be the observed presence/absence status 
of the fcth species at the jth sampling location within the ith unit. The 

yii |Vj = 1 are modeled as i.i.d. Bernoulli trials with success probability 
q\ , that is, for a randomly selected location in Ai, is the probability 
of species k being present given the location is both suitable and available. 

Of course, given Vi(k) = 0, y {k) = with probability 1. Then, we have that 

P(«W = 1) = ql k) Uip\ k) . Gelfand et al. (2005a, 2005b) model the pf ) and 

( k) 

g- using logistic regressions. In fact, they use environmental variables and 

(k) 

spatial random effects to model the p\ s, the probabilities of potential 
presence, and, to facilitate identifiability of parameters, use species level 

(k) 

attributes to model the q\ s. 
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3.2. Probability model for categorical abundance. We first define what 
categorical abundance means at an areal scale using the four ordinal cate- 
gories from Section 2. Suppressing the species index, let X{ denote the clas- 
sification for a randomly selected location in Ai and define p^, = P(Xi = h) 
for h = 0, 1, 2, 3. If A(s) is a four-colored process taking values 0, 1, 2, 3, then 
Pih = no £4. = h)ds. That is, p^ is the proportion of area within 

Ai with color h, equivalently, the proportion in abundance class h. The 
Pih denote the potential abundance probabilities, that is, in the absence of 
transformation. 

We describe land transformation percentage (1 — Uj) as a block average 
of a binary availability process T(s) over Ai. It can also be interpreted as 
the probability that a randomly selected site within Ai is transformed. At a 
transformed location abundance must be 0. Thus, as in Equation (3.2), in the 
presence of transformation, we revise p^ to Pr(Xi = h) = f A . l(T(s) = 
0)l(A(s) = h)ds. Under independence of abundance and land transforma- 
tion, we obtain Pr{Xi = h) = UiPih. The UiPih denote the transformed abun- 
dance probabilities for h ^ 0. The probability of abundance class under 
transformation is evidently 1 — Ui + UiPiQ. Let denote the abundance class 
probabilities in the presence of transformation. 

Finally, suppose there is an observed categorical abundance at location j 
within Ai , say, yij . There is an associated conceptual Xij and an observed Tij . 
Then, Xij 7^ KjTij if there has been transformation degradation at location 
j, unless Xij = 0. Furthermore, if there has been a misclassification error at j, 
Uij 7^ ^ijTij unless Xij = 0. Let denote the abundance class probabilities 
associated with the observed abundances. In Section 3.3 we specify a latent 
trivariate continuous abundance model that produces the p's, r's, and g's 
by integrating over appropriate intervals. This latent model can be viewed 
as the process model for our setting. 

The data set consists of observed abundances across several sampling 
sites within the CFR. Let D denote our CFR study domain so D is divided 
into / = 36,907 grid cells of equal area. For each cell i = 1,2,3, ... , J, we 
are given information on p covariates as Vi = (vn, v i%, . . . , Vi p ). Within Ai, 
the abundance category of a species was recorded at each of ni sampling 
sites. For many cells rn > 1. For site j in Ai we observe yij as a multinomial 

trial, that is, mult({gj^}), j = 1, 2, . . . , m. We have a large number of 

unsampled cells, that is, n, = 0. In fact, out of 36,907 cells, only m = 10,158 
(28%) were sampled at one or more sites. Figure 3 indicates locations of 
sampled cells. For the unsampled cells there are no yjj's in the data set. 
Hence, the inference problem involves estimation of probabilities over the 
observed cells as well as prediction over the unsampled region. Prediction of 
a categorical response distribution for unsampled locations in a point level 
model was discussed in De Oliveira (2000) and Higgs and Hoeting (2010). 
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Fig. 3. Cells within the CFR that have at least one observation from the Protea Atlas 
data set are shown in light grey, while cells with no observations are shown in dark grey. 

In our areal setup with only areal level t?'s, we address this problem with a 
MRF model, as described in Section 3.3. Again, we seek to infer about the 
p's, r's, and g's given the observed y's for a subset of cells and u's known 
for all cells. 

3.3. Latent continuous abundances. It is now common to model the prob- 
ability mass function of a scalar ordinal categorical variable through an 
underlying univariate continuous distribution. In a more general setup, Le 
Loc'h and Galli (1997) and Armstrong et al. (2003) used latent random vec- 
tors to define the categorical probabilities in terms of these vectors taking 
values within a specific set. In a similar spirit, corresponding to an observed 
abundance category variable yij, we introduce a continuous latent variable 
zo,ij such that 



where a = (a_i = — oo,ao = 0,ai,a2,a3 = oo) are an increasing sequence 
of cut points. For identifiability and without loss of generality, we can set 
ao = and interpret zo,ij < as an absence, zo,ij > as a presence. We 
have P(yij = h) = q ih = P(z ,ij € (a h -i, a h )). So q ih will be determined by 
the probability model specified for the zo^j's. We will introduce spatial 
dependence between zo,r/'s below but, for now, to simplify notation, we 
drop the subscript. 

A simple model would put a Gaussian distribution on these latent zo's 
whose means are linear functions of the associated v 's. This would provide a 
routine categorical regression model but ignores known land transformation 
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and potential measurement/ecological error in the recorded abundance cat- 
egories. Instead, we introduce zp^j to provide the py's and zp,ij to provide 
the 7y's. We need a joint distribution to relate the zp, zt, and zo- From 
a process perspective in terms of the proposed degradation, it seems nat- 
ural to specify this distribution in the form f (zp) f (zp\zp) f (zo\zp) . Since 
(zp, zt, zo) capture the sequential degradation of an associated categori- 
cal abundance distribution, we need to use the same set of a's to produce 
meaningful (p,r,q) respectively. Now, we propose (and clarify) the following 
dependence structure. Define c(/i) = fi— where <f>{-) and $(•) are the 

standard normal p.d.f. and c.d.f. respectively. Note that c(ju) = E(V\V ~ 
N(ji, 1), V < 0) so c(fi) < min(0,/i) for all /i € R. Let 



Again, the conditional modeling above is motivated by the degradation 
perspective. To model the latent zp surface, we use the covariate informa- 
tion, that is, climate and soil features that are believed to influence the 
abundance distribution of different species in different ways. We also add a 
spatial random effect (6) in the mean function to account for spatial associ- 
ation that may arise from factors, apart from included covariates, that may 
have a spatial pattern. The covariate effects f3 as well as the spatial random 
effects 6 are species-specific. Variances are fixed at 1 for identifiability (see 
Section 3.5). Since we are working at areal scale, we assign each cell a single 
9 with the prior on 9± 2,... I specified using a Gaussian Markov random field 
(MRF) [Besag (1974)] with first-order adjacency proximities. See Banerjee, 
Carlin and Gelfand (2004) for details as well as further references. 

Next, the zp surface is degraded by land transformation. A random lo- 
cation inside A{ is untransformed with probability u%. Then, zp = zp, that 
is, a degenerate distribution at zp given zp. If it is transformed, the degra- 
dation occurs so that the zp corresponds to the zero abundance category. 
For simplicity (with further discussion below), we make this a degenerate 
distribution at c(zp) < 0, whence zp\zp,u becomes a two point distribution 
as above. Again, transformation is equivalent to absence and since ao = is 
the upper threshold for that classification, we need zp < for a transformed 
location. When a cell is completely transformed, from Equation (3.3) we 
have zp < w.p. 1. For u = 1 (complete availability) zp and zp are the 
same. For any < u < 1, we get E(zp\zp) = uzp + (1 — u)c(zp). 

Also, since c(x) < x, E(zp\zp) < zp, which is essential in the sense that 
transformation can only degrade abundance [and clarifies our choice for c(-)]. 



(3.3) 



P(y = h\z ,u) 
f(zo\zp) 
f(z T \z P ,u) 

f(zp\v,(3,T 2 ) 



l(a/,-i < zo < a h ); 0<h<3 
^(^o; zt, l)lz T >0 + S Zt 1 Zt <q, 
u5 zp + (1 - u)6 c ( Zp ), 
<P(zp;v T (3 + e,l). 
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Posterior summaries of zp measure the prevailing abundance under trans- 
formation within the CFR. [In Appendix A.l we show that |£J(2;t)| < oo.] 
The two-point mixture distribution also implies the probability of abun- 
dance class 0, P(zp < 0) > (1 — u), that is, no matter how large the poten- 
tial abundance is within a cell, for any u < 1 there is a positive probability 
that transformed abundance may fall below at a random location within 
the cell. Other choices for the zp\zp specification besides a point mass at 
c(zp) include putting a point mass at some arbitrary point c < 0, or using a 
truncated normal zp\zp on R - . In the first case, it is not ensured whether 
zp < zp (it depends on whether there are cells with zp < c), while the second 
choice adds complication for no benefit, is less interpretable, and does not 
ensure zp < zp with probability 1. Also, in Section 4 we show that, in terms 
of fitting the model, the specification used in Equation (3.3) is the same as 
using a truncated normal distribution for land transformation. 

Next, we modify the {zp} surface to produce {zo}- With regard to mea- 
surement error, the recorded category of abundance at a particular location 
can be different from the prevailing category due to inaccuracy in field as- 
sessment of species quantity. However, we assume that when the potential 
abundance was zero, one could not record a nonzero abundance category for 
it [no false positives, see Royle and Link (2006) in this regard]. This puts 
a directional constraint on the effect of noise. A specification for f(zp\zo) 
which is coherent with this restriction has, with zp > (i.e., a presence), 
zo\zp ~ N(zp, 1). This is a usual measurement error model (MEM) specifi- 
cation. For a site with no presence zp < 0, our assumption says there cannot 
be any measurement error, thus, in Equation (3.3), for simplicity, we set zo 
to be the same as zp. Again, other choices of zo\zp can be considered for 
the zp < event, but they will not have any impact on estimation of the 
zp surface, as we clarify in Section 4. This sequential dependence structure, 
zp — >• zp — > zo, implies that if zp < so is zp and zo- Hence, if a site is 
not suitable for a species, at no intermediate stage of the model can the site 
have any positive probability of species occurrence. A change in category be- 
tween actual and observed arises when the noise pushes zp to the other side 
of some cut point to produce zq. And, because of the truncation structure, 
that shift cannot happen from the left of ao = to the right. 

An alternative way to jointly model (zo,zp) could use a bivariate normal 
distribution with support truncated to R 2 — [0, oo) x (— oo, 0]. However, this 
specification fails to produce an f(zo\zp) which match our intuition about 
how the degradation took place. Also, from the distributional perspective, 
the truncated normal redistributes the mass contained inside the left-out 
region uniformly across the support, whereas the specification in Equation 
(3.3) shifts the mass only to (zo < 0), which is more in agreement with 
modeling a data set such as ours where we have an inflated number of 
reported zero abundances. 
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The simple dependence structure for zt\zp allows us to marginalize over 
zt and work with zp and zo\zp as our joint latent distribution. We have 



Rewriting Equation (3.4) in a simpler form, we get 

(3.5) f(z \z P ) ~ u[(p(z ; z P ,l)l Zp > + 5 Zp l Zp < ] + (l-u)6 c ( zp y 

Moreover, Equation (3.5) has a nice interpretation in the sense that, first, it 
indicates whether the land is transformed or not with probability 1 — u. If the 
land is transformed, it sets observed abundance to be zo = c(zp) < zp. In 
the case of available land, if there is a potential presence, it allows for a MEM 
around zp; in the case of absence, it stays fixed at zp. Since zo is related 
to the observed data and zp is our surface of interest, the marginalization 
removes one stage of hierarchy from our model fitting and thus reduces corre- 
lation, yielding better behaved MCMC in model fitting. Furthermore, we can 
retrieve the zt surface after the fact since f(zp\zo, zp) oc / \zp\zp) f \zo\zt) ■ 

3.4. Measurement error and bias issues. In the Introduction we noted 
that measurement error and bias typically occur with ecological survey data. 
It can manifest itself in the form of detection error, spatial coverage bias 
[Royle et al. (2007)], and under-reporting of absences. How do these biases 
arise in our modeling? Noteworthy points here are (i) the difference between 
obtaining abundance as actual counts as opposed to through ordinal classi- 
fications and (ii) what "no abundance" means across our collection of grid 
cells. 

Nondetection bias (i.e., undetected individuals in a sampled location) 
tends to be discussed more with regard to animal abundance [Ver Hoef 
and Frost (2003), Royle et al. (2007), Gorresen et al. (2009)]. Using counts, 
evidently observed abundance is at most true abundance; error can occur 
in only one direction. With ordinal counts, the bias is still expected to re- 
flect under-reporting but, depending upon the categorical definitions, will 
be much less frequent and need not be absolutely so. For example, in our 
data set, plant population size is visually estimated and an observation, es- 
pecially of large populations, could potentially have error in either direction. 
In our modeling, "true" abundance is not "potential" abundance. For us, one 
could envision true abundance on the latent scale as a "true" transformed 
abundance, say, zp with measurement error yielding zo- Then, one might 
insist that our measurement error model requires zo < zp. Under our mea- 
surement error formulation using zp, we even allow zo > zp to account for 
potential overestimation of abundance. Evidently, since yo may occasion- 
ally be less than the potential classification yp at that location, we may be 



(3.4) 




zp >0, 
z P < 0. 
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slightly underestimating potential abundance. We don't expect this to be 
consequential and, in any event, with no knowledge about the incidence of 
under- classification in our setting, we have no sensible way to correct for 
this bias. 

Turning to spatial coverage bias (i.e., individuals not exposed to sampling 
will be missed), for us, with only 28% of grid cells sampled, we certainly are 
subject to this. However, the spatial modeling we introduce helps in this 
regard. The mean of zpij is vj (3 + Oi regardless of whether we collected any 
data in A{. So, the regression is expected to find the appropriate level for the 
cell and the spatial smoothing associated with the 9{ is expected to provide 
suitable local adjustment. We could argue that, if sampling of grid cells is 
random, this bias can be ignored. 

Perhaps the most difficult bias to address is the under-sampling of ab- 
sences. This bias counters the previous ones; under-sampling of absences 
will tend to produce over-estimates of potential abundance. In our setting, 
under-sampling of absences is reflected in the decision-making that leads 
to only 28% of cells being sampled, that is, it is not a random 28% that 
have been sampled. Different from spatial coverage bias, in this context, the 
ecologist expresses confidence that the species is not present in some of the 
unsampled cells. If this is so and we were to set some additional abundances 
to 0, this would assert that these "0"s are not nondetects and would di- 
minish potential abundance, opposite to the case of nondetects. Of course, 
in the absence of actual data collection, we would not see any of these 0's 
and would adopt model-based inference regarding potential abundance for 
these cells. In any event, with no explicit knowledge of how sampling sites 
were chosen, we are unable to attempt correction for this bias. Possibly, 
approaches to address the effects of preferential sampling [Diggle, Menezes 
and Su (2010)] could be attempted here. 

3.5. Likelihood and posterior distribution. The posterior distributions of 
interest, p and r, will be constructed in the post MCMC analysis (discussed 
in detail in Section 4.3). From the conditional structure we first write P(y = 
c\zo, ct) = l 2o e(a c _ liac ). So the likelihood function for a single sample y turns 

out to be L(y\z ,a) = l\l =0 l(z o G (a fc _i, a fc )) 1(y=fc) ■ Now f(z \z P ) can be 
written as in Equation (3.5). 

Again, we have I cells with m sampling sites within A^. For each y^ we 
introduce a corresponding zo,ij, and hence a pair of ZT,ij , Zp t ij , to represent 
the event happening at the jth sampling site within A{. We work directly 
with the zo\zp structure. Since we are interested in the areal level abundance 
distribution and have covariates at areal resolution, we assume for fixed i, 

zp-ij N{-\ vj (3 + 9i, 1). It is also assumed that the zo^j's are conditionally 
independent given the Zp^j's. 
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Without loss of generality, re-index cells so that the first m of them are 
sampled and the last I — m are not. The latter cells have no contribution to 
the y column and, hence, no associated zo appears in the likelihood. Using a 
nonspatial model, we would work with a posterior on the domain of sampled 
cells only. But assuming a CAR prior structure with adjacency proximity 
matrix W for the 6 over the whole domain enables us to learn about zp for 
unsampled cells. In summary, the posterior distribution takes the following 
form, up to proportionality, with = (a, (3,6): 

m m 

vr(z P , z , G|y, v, u) oc J J Lfaj \zo,ij,a)f(zo,ij\zp t ij)f(zp i ij\v i ,Q) 
i=ij=i 

X7T(G), 

(3.6) 

tt(G) =7r(a)7r(/3)7r(0), 
n(e 1>2j ... tI ) = CAR(r l0 ,W). 

We turn to the identifiability of the set of parameters under the hierar- 
chical dependent latent structure. First, with a latent continuous process 
yielding an ordinal categorical variable, the mean and scale of the distribu- 
tion can be identified only up to their ratio. In Equation (3.3) the depen- 
dence across (z p ,zt,Zo) is specified through conditional means. Hence, all 
Gaussian distributions there are specified with standard deviation 1. Four 
categories of abundance allow three free probabilities and the corresponding 
four latent surfaces will also have 3 degrees of freedom. As noted above, we 
set ao = with a\, ai as free parameters. 

We also need to ensure that all three z surfaces can be distinguished from 
each other. Since transformation percentage 1 — u is given a priori, it is 
straightforward to separate zp and zt- We turn to the joint distribution for 
zp,zo given as zo\zp ~ N(zp, l),zp ~ N(vj3 + 9, 1). With fixed variances 
and no constraint on measurement error, there would be no need to bring in 
zp; it is redundant, there is no way to distinguish between zp and zo, and 
one can use the marginal zo ~ N{v[i + 6, 2). Now the constraint comes into 
the picture; it makes the zo surface non-Gaussian though the zp surface is. 
The greater the measurement error, the more departure from Gaussianity 
in the marginal distribution of zo- Again, the measurement error cannot 
be estimated on any absolute scale, since the latent z scales are fixed for 
identifiability. It will be controlled by parameters like /3 and 6. To compare 
the relative effect of measurement error across different species, under fixed 
scale parameters, P(zp < 0) is a candidate but other model features can be 
informative as well. 

Finally, the full model specification, described in Equation (3.3), can be 
represented through a graphical model, shown in Figure 4. 
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(3 >■ z P,ij >- z T,ij 5- Zo.ij 



Y Y 

J/p,ij < a > !/o,ij 

Fig. 4. Graphical model for latent abundance specification at site j within cell Aj. z's 
denote latent abundance processes, observed (O), transformed (T), and potential (P); y's 
denote interval- censored abundances, observed (O), and potential (P); u is proportion of 
land untransformed, v 's are covariates, f3 's are regression coefficients, a 's are cut points 
for z scale, and 6 's are spatial random effects. 

4. Posterior computation and inference. Here, we describe how to design 
a computationally efficient MCMC algorithm for the model. We then dis- 
cuss how to summarize the posterior samples to estimate important model 
features. 

4.1. Sampling. Introduction of latent layers, although increasing the pa- 
rameter dimension in the model, makes the posterior full conditionals stan- 
dard and easy to sample from. Our goal is to efficiently estimate components 
of which control potential abundance. We rewrite Equation (3.5) as fol- 
lows: 

f{z \z P ) = u<p(z ;z P , l)l 2p>0 + [uS Zp + (1 - u)S c r z )]l Zp<0 

(4.1) 

+ 1 

and work with Equation (4.1) to implement the computation for the model 
fitting. 

We start with updating all zo,zp using (0W;y,v,u) and then drawing 
components of from their respective posterior full conditionals based on 
Zp(t+\)-> z o,(t+l)- Given the draw from Zp, sampling the components of 
is standard as in almost any spatial regression analysis (see Appendix A. 3). 
For the set of 0's, after sampling them sequentially, we need to "center them 
on the fly" [Besag and Kooperberg (1995), Gelfand and Sahu (1999)]. The 
more challenging part is to update zo,zp\Q. In Albert and Chib (1993) 
the latent variables were sampled in the MCMC from mutually independent 
truncated Gaussian full conditionals, with the support determined by the 
corresponding classification. For our model, the posterior full conditional for 
any zo is 

n(zo\zp,y,u) oc f(z \z P )l(z G (a y -i,a y )). 

We take two different strategies to update zp,zo depending on the ob- 
served y. For any site with nonzero y we have (with oq =0) f(zo,zp\y > 
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0, it) oc (j>(zo\zp)l zoe ( ay _ liCly -)4>(zp)i which amounts to sampling first a uni- 
variate normal Zo y (t+i)\{zp^t),y,ct) truncated within {piy^ay ) and then 
from zp,(t+±) \ zo,(t+i)j® which is also Gaussian [with location (^o,(t+i) + 
/j,( t ')/2 and scale y/l/2 where p*) = v T '/Jw + For a site with ob- 
served y = the case is more complicated, with details provided in Ap- 
pendix A. 2. All of the sampling distributions required in MCMC are listed 
in Appendix A. 3. 

4.2. Computational efficiency. The algorithm described above is compu- 
tationally demanding as we have two latent variables to sample at each sam- 
pling site and one spatial parameter for each of the grid cells. However, since 
zp, zo are independent across cells given O, we can update them all at once. 
The problematic part is sampling the spatial effects, with approximately 
37,000 grid cells. To handle this issue, we used a parallelization method 
where D is divided into disjoint and exhaustive subregions Di, D2, ■ . ■ , Dl 
along with a resultant set of boundary cells B arising through the CAR 
proximity matrix. Thus, once Op is updated conditional on the rest, then 
Qdi j 0D 2 > • • • ) ®D L given Op can be updated in parallel. 

This algorithm is illustrated in Figure 5, where we have a 15 x 8 rect- 
angular region with an adjacency structure W which puts weight only on 
the cells sharing a common boundary. Sequential updating would have re- 
quired 120 steps. We constructed a set of 48 boundary cells B (the dark 
cells in Figure 5). It divides the rectangle into 12 segments of 6 cells each, 
so that conditional on 9p, those segments can be updated independently of 
each other so we need only 54 updating steps. This is only an illustrative 
example, but for large regions, this may significantly improve the run time. 
However, the time required for communication and data assimilation is an 
issue for this parallelization method. With increasing L, although the time 
required for the sequential updating within each Di goes down, the size of 
B increases as does the amount of communication required within the par- 
allel architecture. So a trade-off must be determined for choosing L; in our 
setting L = 11 worked well. 

4.3. Posterior summaries. There are several ways to summarize infer- 
ence about the p and r distributions. According to our model, for A4, 
Pih = &(ah — vj p — 6i) — §{cth-i — vff3 — 0i). Posterior samples of /3,#,r 2 
enable us to compute samples of the pi. A posterior sample of r% can be 
constructed using the relation V{ = (1 — u% + UiPi Q , Uipn, UiPi2, UiPis) . Addi- 
tionally, we can calculate the mean as well as the uncertainty from these 
samples, enabling maps for transformed abundance (r) and potential (p) 
abundance. For each of pi and rj, we have 4 submaps, one for each abun- 
dance category. This is useful in terms of assessing high and low abundance 
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Fig. 5. An example grid to illustrate parallel CAR implementation. Normal sequential 
updating would have required 120 steps in each iteration. By dividing the rectangle into 12 
segments of 6 cells each with 48 boundary cells (shown in dark grey), each segment can be 
updated independently (conditional on the boundaries) . In this example the parallelization 
results in only 54 updating steps. 

regions for the species. The /3's provide (up to fixed scale) the effect of a par- 
ticular climate or soil-related factor on the abundance of a particular species. 
Comparison of the p and r maps informs about the effect of land transfor- 
mation. One may also be interested in capturing p or r through a single 
summary feature rather than all 4 categorical probabilities. Grouped mean 
abundance (expectation with respect to the p or r distribution) can be used 
with suitable categorical midpoints. We note that the posterior inference 
can also be summarized on the latent scale using posterior samples of the 
z's. However, working on the z scale can only provide relative comparison. 

5. Data analysis. We have implemented the described model on abun- 
dance data for several different plant species over the whole CFR. We 
centered and scaled all the u's before using them in the model. As pri- 
ors we used 7r(a) = 1, vr(/3) = N(0,(fil) with large <f> = 100. For 9, we used 
t/q = 0.1 and W to be a binary matrix with w(i,j) = 1 iff d(i,j) < 0.30. 



Table 1 

Posterior summaries for covariate effects (mean and 95% c.i. width) 



Species 


Apan.mean 


MaxOl 


Min07 


Mean.an.pr 


Sumsmd 


Fertl 


PRPUNC 


1.2275 


-0.9436 


-0.8248 


0.2439 


0.1834 


0.0306 




(0.3809) 


(0.2768) 


(0.1143) 


(0.1158) 


(0.2006) 


(0.1089) 


PRREPE 


0.6825 


-0.4512 


-0.0864 


0.1753 


-0.2958 


0.0566 




(0.1710) 


(0.1179) 


(0.0612) 


(0.0673) 


(0.0996) 


(0.0455) 
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Fig. 6. Posterior mean spatial effects (6) for Protea punctata (PRPUNC) and Protea 
repens (PRREPE). These effects offer local adjustment to potential abundance. Cells with 
values greater than zero represent regions with larger than expected populations, conditional 
on the other covariates. 



The threshold 0.30 was used to provide an 8 nearest neighbor structure 
for most of the cells. However, for boundary cells, the number of neighbors 
varies from 3 to 6. The parallelization algorithm was implemented inside R 
(http://www.r-project.org) using I = 11. The run time for an individual 
species was about 9000 iterations / day. The outputs presented below are cre- 
ated by first running 12500 iterations of MCMC, discarding the initial 7500 
samples, and thinning the rest at every fifth sample. The /3's were quick to 
converge, but the a sequences were highly autocorrelated and moved more 
slowly in the space. 

Here we consider two species, Protea punctata (PRPUNC) and Protea 
repens (PRREPE). A summary of the model output is presented through 
the following table and diagrams. Table 1 provides the mean covariate effects 
for both species along with the 95% equal tail credible interval width (in 
parentheses). Considering 95% equal tail credible interval, all the covariate 
effects are significant except Fertl for P. punctata. 

The mean posterior spatial effects are shown in Figure 6. Note that the 
spatial effects for the two species have quite different patterns, with Protea 
repens having a region of low values in the northeast and larger values 
elsewhere, while Protea punctata is more even across the landscape, but 
with lower values toward the edges of the CFR. These surfaces capture the 
spatial variability in abundance that is not explained by the other covariates 
within the model. This suggests that the covariates predict higher abundance 
of P. repens in the northwest than what was observed, perhaps indicating 
some unobserved limiting factor (such as unsuitable soils, more extreme 
seasonality in rainfall, or dispersal limitations). Similarly for P. punctata, 
the covariates may over-predict abundances at the edges of the CFR where 
many environmental factors change as one transitions to other biome types. 
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Fig. 7. Abundance category probability maps for Protea punctata (PRPUNC) for un- 
transformed (left) and transformed (right) situations. Values are cellwise posterior mean 
probabilities for the abundance classes. Class means the probability the species is absent, 
while classes 1-3 indicate estimated abundance from 1-10, 11-100, 100+ individuals, re- 
spectively. 



Figures 7 and 8 show the mean posterior abundance category probabil- 
ities (potential and transformed) for P. punctata (Figure 7) and P. repens 
(Figure 8). Comparing these plots among rows contrasts the probabilities 
associated with each abundance class for the species, while comparing be- 
tween columns shows the effects of landscape transformations on abundance 
class probabilities. Both species show higher predicted abundances coincid- 
ing with mountainous areas of the CFR. This is where the fynbos biome 
dominates the landscape and where proteas are characteristically the dom- 
inant, indicator species [Rebelo et al. (2006)]. Note that P. punctata, a less 
common species, is only slightly affected by landscape transformation, while 
there are dramatic differences for P. repens (Figures 9 and 10). This is 
because P. punctata is mostly limited to dry, rocky, or shale slopes [Re- 
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Fig. 8. Abundance category probability maps for Protea repens (PRREPE) for untrans- 
formed (left) and transformed (right) situations. Values are cellwise posterior mean prob- 
abilities for the abundance classes. Class means the probability the species is absent, 
while classes 1-3 indicate estimated abundance from 1-10, 11-100, 1008 individuals, re- 
spectively. 



belo (2001)] which are less suitable for agriculture or development and thus 
mostly untransformed. P. repens, on the other hand, is much more ubiqui- 
tous across the region and can frequently occur in lowland areas that have 
been largely transformed by human activities [Rebelo (2001), Rebelo et al. 
(2006)]. 

It is also useful to summarize these data through mean potential abun- 
dance and mean transformed abundance (see Section 4.3) as in Figures 9 
and 10. These figures allow inspection of the underlying latent surfaces that 
are of interest to ecologists as a continuous relative representation of species 
abundances. However, the latent u z" scales may be difficult to interpret eco- 
logically and, thus, estimated potential and transformed abundance (using 
the grouped mean) are also shown. These represent the expected abundance 



22 



A. CHAKRABORTY ET AL. 



Mean z[P] 



Mean z[T] 



L 



L 




Grouped Mean Abundance 



Grouped Mean Transformed Abundance 



140 

120 

100 

80 

60 

40 

20 





Fig. 9. Mean posterior abundance summaries for Protea punctata (PRPUNC). On the 
latent z-scale, "Mean z[P] " refers to the potential abundance, while "Mean z\T]" refers 
to the potential abundance corrected for habitat transformation. The Grouped Mean Abun- 
dance rescales the Mean z[P] surface to the expected potential size of a population in a 
grid cell (using the observed abundance classes: absent, 1-10, 10-100, 100+). The Group 
Mean Transformed Abundance shows the expected size of a population after correcting for 
habitat transformation. 



(with respect to the p's or r's) of a species at a randomly selected sample 
location in that grid. The associated display makes it easy to visualize the 
effects of habitat transformation on protea populations. P. punctata shows 
almost no effects of landscape transformation, while large differences are 
apparent for P. repens. Note the large transformed regions in the south and 
west where the expected abundance of plants has dropped from more than 
50 to near zero. It is also apparent that, across the landscape, P. punctata 
tends to have a higher expected mean abundance at any given sample point 
than does P. repens (Figures 9 and 10). 

6. Discussion and future work. Building on previous efforts that have 
addressed the presence/absence of species, we have presented a modeling 
framework for learning about potential patterns for species abundance not 
degraded by land transformation and potential measurement error. The 
model was built using a hierarchical latent abundance specification, incor- 
porating spatial structure to capture anticipated association between adja- 
cent locations. Along with potential pattern, we also have an estimate of 
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Fig. 10. Mean posterior abundance summaries for Protea repens (PRREPE). On the 
latent z-scale, "Mean z[P]" refers to the potential abundance, while "Mean z[T]" refers 
to the potential abundance corrected for habitat transformation. The Grouped Mean Abun- 
dance rescales the Mean z[P] surface to the expected potential size of a population in a 
grid cell (using the observed abundance classes: absent, 1-10, 10-100, 100+). The Group 
Mean Transformed Abundance shows the expected size of a population after correcting for 
habitat transformation. 

transformed abundance pattern. Comparison of these two patterns is help- 
ful for understanding the effect of land transformation on species presence 
and abundance and, in particular, for disentangling these effects from those 
of other environmental factors. This may facilitate designing strategies for 
species conservation as well as understanding the overall effects of climate 
change. 

This work has applications in biogeography and in conservation biology 
[Pearce and Ferrier (2001), Gaston (2003)]. We can now develop predictive 
maps of "high quality" habitat sites within a species range, based on high 
predicted abundances. This will help identify prime locations for effective 
conservation efforts. We can also estimate the impact of habitat transfor- 
mation on the size of the population using the information from Figures 8 
and 10, and thus identify threats to conservation. Predictive abundance 
maps will also be useful to explore patterns in biodiversity and species abun- 
dances. Do species abundances tend to peak in the middle of the species' 
range [Gaston (2003)]? Do areas of high biodiversity tend to have lower 
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species abundances? Are there areas that are rich in both abundance and 
biodiversity (perhaps identifying ideal regions for conservation efforts)? 

There are several natural extensions. One is to study the temporal change 
in abundance. With abundance data collected over time as well as associated 
environmental factors such as rainfall and temperature, dynamic modeling 
of species abundance with changing environmental factors may give a clearer 
picture of how a species is responding to climate change. Indeed, when con- 
nected to future climate scenarios, we may attempt to forecast prospective 
species abundance. Similarly, if the transformation data is also time varying, 
we could illuminate the effect of land transformation in greater detail. 

The current model uses transformation percentage (1 — u) in a determin- 
istic way (transformation having a binary effect on potential abundance) . In 
other cases (e.g., to study abundance pattern of animals) it may be reason- 
able to treat transformation as another covariate influencing species habitat. 
Also, it may be imagined that the relationship between potential abundance 
and environmental variables is not linear as specified in equation (3.3), for 
example, environmental variables may affect larger abundance classes differ- 
ently from smaller abundance classes; piecewise linear specification, intro- 
ducing different regression coefficients over the different abundance classes, 
could be explored. 

Another possible extension lies in joint modeling of two or more species. 
One may wish to learn whether two plant varieties are sympatric or al- 
lopatric and whether or not there is evidence for competitive interactions or 
facilitation. Such modeling can be done by extending our model to have mul- 
tiple {zp : k, ZT : k, zo,k) surfaces, where k is the species indicator. Dependence 
can be introduced across zpk surfaces by modeling 9^ using an MCAR 
[Gefand and Vounatsou (2003), Jin, Carlin and Banerjee (2005)]. Fitting 
such models will be very challenging if there are many grid cells. 

Instead of taking an areal level approach, if covariate information is avail- 
able at point level (where sampling sites are viewed as "points" within the 
large region, D), one may consider a point-level model. This amounts to re- 
placing the CAR model with a Gaussian process prior for the spatial effects. 
With many sampling sites, we will need to use appropriate approximation 
techniques [Banerjee et al. (2008)]. 



APPENDIX 

A.l. Proof of E(z T ) finite. E(z T ) = E(E(z T \z P )) = E(uz P + (1 - u) x 




oo 1— Q(x) 



Assuming zp ~ N(fi,l), it is enough to 
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Consider the quantity x 2 fz^h <fi(x — fi), if x — > — oo, it goes to 0. When 
x — > oo, we have 

2 <f>(x) 

hm x - ^T^Pix ~ N 

x->oo l-$(x) 

L'ptai ,. 2x(j)(x)(/)(x — a) — x 3 (j)(x)(j)(x — a) — x 2 (x — /xWxWx — /i) 
= hm 



-4>(x) 



So limi^i^oo x 2 it$( x ) 4>( x ~ A*) = 0, thus, we can get B\ < 0, E>2 > 0, such that 
<M X — A*) < J? f° r an x £ (BiiBz). Hence, the result follows. 

A.2. Posterior simulation of z's for a site with no presence observed. 

We subdivide by considering the ways that we can generate a realization 
of y based on Equation (3.5) [one may also use Equation (3.3) to do this]: 

(i) The area is untransformed, the species was potentially there, but missed 
during data collection or it was absent at that time instance; the event 
is l Z p>a ,zo<ao with prior probability itj = uP(zp > ao,zo < «o)- 

(ii) Potentially the species was absent there; the event is l Zp <a with prior 
probability 7T2 = P(zp < ato). 

(hi) The species was potentially there l^p>«o! but the area was transformed; 
the event has prior probability tt^ = (1 — u)P{zp > oiq). 

These three events are exhaustive and mutually exclusive for the event 
(y = 0). Thus, f(zp, Zo\y = 0, 0) is a 3-component mixture. To draw a 
(zp,zo) pair from this distribution amounts to first choosing a component 
and then drawing a pair (zp,zo) from that component distribution. By 
Bayes' rule, conditional on observed (y = 0), these three cases can happen 
with posterior probability 7r.;/ J^f=i n h i = 1,2,3. So we use a multinomial 
to select which of these events took place. Before going into case by case 
details, it is worth mentioning that in all these cases the sampling from 
the joint density of chosen mixture component was done via the marginal 
f(zp\ ■ •) followed by f(zo\zp,- •). The advantage of this scheme is that we 
don't need to draw from the latter because zo's corresponding to y = are 
not involved in posterior full conditionals of any other parameters in the 
model (as ao = 0, fixed). If the second case is selected, then f(zo,Zp\-, •) oc 
[u5 Zp + (1 -u)5 c ( Zp )]l Zo <ol Zp <o(l){zp) and thus marginalizing over z , we get 
f(zp\ ■ ■) oc 4>(zp)l Zp <o which is a truncated Gaussian on R~. Similarly un- 
der case (iii), we need to simulate zp from 4>(zp)l Zp >o, a Gaussian truncated 
on R + . In case (i), f(zo,zp\ ■ •) oc 4>(zo;zp, l)l Zp > l zo <o(j)(zp), so marginal- 
izing over zo we get f(zp\ • •) oc (p(zp)(l — &(zp))l Zp >o. An efficient way to 
draw from this density is to propose & zp from a truncated normal on M + 
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and do a Metropolis-Hastings update with an independent proposal, using 
the quantity (1 — <&(•))• However, all sampling distributions are summarized 
in Appendix A. 3 below. 

A.3. Posterior full conditionals needed for Gibbs sampling. 

• If yij > 0, draw z ,ij ~ N(z Pjij , l)l( av ,._ uay ,.y Draw z P ^ ~ N( Vi ^ + 

z o,jj 1\ 
2 ' 2>- 

• If yij = 0, compute^ = (u$ 2 ([0, oo] x [-oo, 0]; S ), 1 - $(vf (3 + 6), (1 - 
u)<S>(vf(3 + 0i)), where % = (vf (3 + h vJ ft + 6»,) and S = {\ \) are 
the location and dispersion parameters for bivariate normal joint prior 
distribution of {zo,ij,zp t ij). Draw dij '~ ' mult(pjj). If dij = 1, propose 
^propose ^ N^yjp _|_ q.^ i)i^ ^ anc l d a Metropolis-Hastings sampler us- 
ing (1 — <£(•)). Else if dij = 2, draw zp^j ~ N(vf (3 + 9i, l)l(_oo,o)) e ^ se draw 
z Ptij ~N(vTp + 9i, l)l ( o,oo) • 

• Draw q?/j = unif^axy:^^^ ,,. ; /, . | zo.ij), h=l,2. 

• Draw ,3- iV(^, S^) fly N(z P>ij ; v h [3, 0*). 

. Draw Oi ~ N(z P>ij ;vi,P, 9i)N(^^, ^-) for i = 1, 2, . . . , m. Draw ^ ~ 
N psy B i A.) for i = m+l,2,...,7. 
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